library(data.table) # Efficient Dataframe
library(lubridate) # For Dates
library(tidyverse) # Multiple Package for Useful Data wrangling
library(esquisse) # Intuitive plotting
library(plyr) # Data splitting
library(dplyr) # Data Wrangling
library(ggplot2) # Plot Graphs
library(naniar) # for NA exploration in Dataframe
library(plotly) # Make ggplot2 Dynamic
library(gridExtra) # Multiple Plot at once
library(RColorBrewer) # For Color Palette
library(rmdformats) # Theme of HTML
library(flextable) # Show Table
library(class) # K-NN
library(summarytools) # Beautiful and Efficient Summary for Dataset
library(pivottabler) # Pivot Table
library(naivebayes) # Naive Bayes Function
library(caret) # Confusion MatrixThose are required packages
If the company beings working with a new set of 1000 leads to sell the same services, similar to the 500 in the plot study, witout any use of predictive modeling to target sales efforts, what is the estimated profit?
Without any predictive modeling, we can roughly estimated the profit with the following formula:
\[ Sales_{Estimated} = 1000*\$2500 = \$250'000 \]
But the company will also have expenditures related to their sales, which would negatively impact the Total Profit. The sales effort would be:
\[ Costs_{Estimated} = 1000*\$2500 = \$250'0000 \]
Leading to Total Profit of…
\[ TotalProfit_{Estimated} = Sales_{Estimated} - Costs_{Estimated} \\ = \$250'000 - \$250'000 = 0 \]
If the firm wants the average profit on each sale to at least double the sales effort cost, and applies an appopriate cutoff with this predictive model to a new set of 1000 leads, how far down the new list of 1000 should it proceed (how many deciles)?
\[ Ratio_{Estimated} = \dfrac{2*\$2500}{\$2500} = 2 \]
If we want to double the average profit on each sale, we should take the first decile (10%) on the Decile-wise lift chart which double the mean-response.
Still considering the new list of 1000 leads, if the company applies this predictive model with a lower cutoff of $2500, how far should it proceed down the ranked leads, in terms of deciles?
We want the cutoff to be $2500:
\[ Ratio_{Estimated} = \dfrac{\$2500}{\$2500} = 1 \]
If we take a look at the Decile-wise lift chart, we see that until the 6th decile we would get a mean response of 1 for all those included deciles until 6th one.
Why use this two-stage process for predicting sales–why not simply develop a model for predicting profit for the 1000 new leads?
This two-stage process for predicting sales already give possible answer to the managers for the 1000 leads predictions and which target customers they should choose for achieving such goals. Those 2 graphs help getting fast insights to the sales predictions and what to do for achieving the goals required by the board or company management.
## Data Frame Summary
## UniversalBank1
## Dimensions: 5000 x 14
## Duplicates: 0
##
## ---------------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- -------------------- ------------------------------ ---------------------- ---------------------- ---------- ---------
## 1 ID Mean (sd) : 2500.5 (1443.5) 5000 distinct values : : : : : : : : : : 5000 0
## [integer] min < med < max: (Integer sequence) : : : : : : : : : : (100.0%) (0.0%)
## 1 < 2500.5 < 5000 : : : : : : : : : :
## IQR (CV) : 2499.5 (0.6) : : : : : : : : : :
## : : : : : : : : : :
##
## 2 Age Mean (sd) : 45.3 (11.5) 45 distinct values . : : 5000 0
## [integer] min < med < max: . : . : . : : . (100.0%) (0.0%)
## 23 < 45 < 67 : : : : : : : :
## IQR (CV) : 20 (0.3) : : : : : : : : : :
## : : : : : : : : : :
##
## 3 Experience Mean (sd) : 20.1 (11.5) 47 distinct values . . : : . 5000 0
## [integer] min < med < max: : : : : : : . : (100.0%) (0.0%)
## -3 < 20 < 43 : : : : : : : :
## IQR (CV) : 20 (0.6) . : : : : : : : : .
## : : : : : : : : : :
##
## 4 Income Mean (sd) : 73.8 (46) 162 distinct values : 5000 0
## [integer] min < med < max: . : : : (100.0%) (0.0%)
## 8 < 64 < 224 : : : :
## IQR (CV) : 59 (0.6) : : : : . .
## : : : : : : : : .
##
## 5 ZIP Code Mean (sd) : 93152.5 (2121.9) 467 distinct values : 5000 0
## [integer] min < med < max: : (100.0%) (0.0%)
## 9307 < 93437 < 96651 :
## IQR (CV) : 2697 (0) :
## :
##
## 6 Family Mean (sd) : 2.4 (1.1) 1 : 1472 (29.4%) IIIII 5000 0
## [integer] min < med < max: 2 : 1296 (25.9%) IIIII (100.0%) (0.0%)
## 1 < 2 < 4 3 : 1010 (20.2%) IIII
## IQR (CV) : 2 (0.5) 4 : 1222 (24.4%) IIII
##
## 7 CCAvg Mean (sd) : 1.9 (1.7) 108 distinct values : 5000 0
## [numeric] min < med < max: : . (100.0%) (0.0%)
## 0 < 1.5 < 10 : : .
## IQR (CV) : 1.8 (0.9) : : :
## : : : : . . .
##
## 8 Education Mean (sd) : 1.9 (0.8) 1 : 2096 (41.9%) IIIIIIII 5000 0
## [integer] min < med < max: 2 : 1403 (28.1%) IIIII (100.0%) (0.0%)
## 1 < 2 < 3 3 : 1501 (30.0%) IIIIII
## IQR (CV) : 2 (0.4)
##
## 9 Mortgage Mean (sd) : 56.5 (101.7) 347 distinct values : 5000 0
## [integer] min < med < max: : (100.0%) (0.0%)
## 0 < 0 < 635 :
## IQR (CV) : 101 (1.8) :
## : : . .
##
## 10 Personal Loan Min : 0 0 : 4520 (90.4%) IIIIIIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.1 1 : 480 ( 9.6%) I (100.0%) (0.0%)
## Max : 1
##
## 11 Securities Account Min : 0 0 : 4478 (89.6%) IIIIIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.1 1 : 522 (10.4%) II (100.0%) (0.0%)
## Max : 1
##
## 12 CD Account Min : 0 0 : 4698 (94.0%) IIIIIIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.1 1 : 302 ( 6.0%) I (100.0%) (0.0%)
## Max : 1
##
## 13 Online Min : 0 0 : 2016 (40.3%) IIIIIIII 5000 0
## [integer] Mean : 0.6 1 : 2984 (59.7%) IIIIIIIIIII (100.0%) (0.0%)
## Max : 1
##
## 14 CreditCard Min : 0 0 : 3530 (70.6%) IIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.3 1 : 1470 (29.4%) IIIII (100.0%) (0.0%)
## Max : 1
## ---------------------------------------------------------------------------------------------------------------------------
We can see that most of the variables are of type “integer” except CCAvg being of type “numeric”. We have no missing datas in all variables. We can see this better with a plot:
The Following Code Set Seed to 1 and partition the dataset in 2 sets, training and validation.
# Setting Seed
set.seed(1)
# Splitting Training and Validation
sample <- sample(c(TRUE, FALSE), nrow(UniversalBank1), replace=TRUE, prob=c(0.6,0.4))
training <- UniversalBank1[sample, ]
validation <- UniversalBank1[!sample, ]
# Checking if proportions are right
train_prop <- dim(training)
validation_prop <- dim(validation)
train_prop_100 <- (train_prop[1]/nrow(UniversalBank1))*100
validation_prop_100 <- (validation_prop[1]/nrow(UniversalBank1))*100
paste(train_prop_100,"% In Training",validation_prop_100,"% In Validation")[1] “61 % In Training 39 % In Validation”
Here is the confirmation of the effective pourcentages of each set category after the partition process.
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1.
Perform a Κ-NN Classification with all predictors except ID and ZIP code using Κ = 1
# Setting Seed
set.seed(1)
# Removing Some Predictors
training <- training[,-c("ID","ZIP Code")]
validation <- validation[,-c("ID","ZIP Code")]
# Target Variable As Factor
training$`Personal Loan` <- factor(training$`Personal Loan`, levels = c(0,1),labels = c("No Loan","Loan"))
validation$`Personal Loan` <- factor(validation$`Personal Loan`, levels = c(0,1),labels = c("No Loan","Loan"))
# Education As Factor
training$Education <- as.factor(training$Education)
validation$Education <- as.factor(validation$Education)
# Education One-Hot Encoding
Education_As_Dummy_Training <- model.matrix(~0+training$Education)
Education_As_Dummy_Validation <- model.matrix(~0+validation$Education)
# Append to Training and Validation Sets
training <- cbind(training,Education_As_Dummy_Training)
training <- training[,-c("Education")]
validation <- cbind(validation,Education_As_Dummy_Validation)
validation <- validation[,-c("Education")]
# Renaming Education
training = training %>% rename( Education_1 = `training$Education1` , Education_2 = `training$Education2`, Education_3 = `training$Education3`)
validation = validation %>% rename( Education_1 = `validation$Education1` , Education_2 = `validation$Education2`, Education_3 = `validation$Education3`)
# Preprocess for Data Normalization
training_norm <- training
validation_norm <- validation
training_norm_s <- training[,-c("Personal Loan")]
norm_values <- preProcess(training_norm_s,method = c("center","scale"))
training_norm <- predict(norm_values,training)
validation_norm <- predict(norm_values,validation)
# KNN Model using class package
library(class)
# Data frame for a specific customer not in Data
Customer_Test <- data.frame("Age"=40,"Experience"=10,"Income"=84,"Family"=2,"CCAvg"=2,"Mortgage"=0,"Securities Account"=0,"CD Account"=0,"Online"=1,"CreditCard"=1,"Education_1"=0,"Education_2"=1,"Education_3"=0, check.names=FALSE)
# Preprocess the Customer New Data
Customer_Test_norm <- predict(norm_values, Customer_Test)
## KNN Training for Customer
predictions_customer <- knn(train=training_norm[,-c("Personal Loan")],test = Customer_Test_norm, cl = training_norm$`Personal Loan`, k=1)
# Append Predictions to Customer not in Data
Customer_Test$Predicted <- predictions_customer# Table Customer after Normalization
flextable(head(Customer_Test_norm)) %>%
fontsize(size = 7, part = "all")Age | Experience | Income | Family | CCAvg | Mortgage | Securities Account | CD Account | Online | CreditCard | Education_1 | Education_2 | Education_3 |
-0.4566169 | -0.8659041 | 0.199656 | -0.3477529 | 0.01439031 | -0.5640238 | -0.3523862 | -0.264863 | 0.8253143 | 1.51897 | -0.8658006 | 1.643341 | -0.6565908 |
# Table Customer after Predictions
flextable(head(Customer_Test)) %>%
fontsize(size = 7, part = "all")Age | Experience | Income | Family | CCAvg | Mortgage | Securities Account | CD Account | Online | CreditCard | Education_1 | Education_2 | Education_3 | Predicted |
40 | 10 | 84 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | No Loan |
This Customer would be classified as not getting a Personnal Loan (No Loan) by K-NN with K=1.
We would to test multiple K such that the best accuracy would be chosen between the training and validations set (cross validation procedure)
# Setting Seed
set.seed(1)
# Load Caret
library(caret)
# Number of iterations
max_iterations = 30
# Dataframe with 2 columns: k and accuracy
accuracy.df <- data.frame(k=seq(1,max_iterations,1),accuracy=rep(0,max_iterations))
# Compute K-NN for different k on validation
for(i in 1:max_iterations){
# Testing K-NN
knn.prediction <- knn(train = training_norm[,-c("Personal Loan")], test=validation_norm[,-c("Personal Loan")] , cl=training_norm$`Personal Loan`, k=i)
# Storing into the accuracy.df results
accuracy.df[i,2] <- confusionMatrix(knn.prediction, validation$`Personal Loan`)$overall[1]
}
# Table of Accuracy
flextable(accuracy.df) %>% fontsize(size = 12, part = "all")k | accuracy |
1 | 0.9676923 |
2 | 0.9574359 |
3 | 0.9641026 |
4 | 0.9574359 |
5 | 0.9564103 |
6 | 0.9533333 |
7 | 0.9538462 |
8 | 0.9507692 |
9 | 0.9507692 |
10 | 0.9476923 |
11 | 0.9487179 |
12 | 0.9471795 |
13 | 0.9466667 |
14 | 0.9451282 |
15 | 0.9456410 |
16 | 0.9446154 |
17 | 0.9425641 |
18 | 0.9415385 |
19 | 0.9415385 |
20 | 0.9425641 |
21 | 0.9420513 |
22 | 0.9415385 |
23 | 0.9400000 |
24 | 0.9394872 |
25 | 0.9405128 |
26 | 0.9389744 |
27 | 0.9389744 |
28 | 0.9358974 |
29 | 0.9369231 |
30 | 0.9374359 |
# Ploting the K and accuracy together
ggplotly(
ggplot(accuracy.df) +
aes(x = k, y = accuracy) +
geom_line(size = 0.5, colour = "#1c6155") +
labs(x = "Number of K",
y = "Accuracy (Between Training and Validation)", title = "K-NN Accuracy regarding parameter K") +
theme_minimal()
)# Choosing Efficient K
highest_K <- which.max(accuracy.df$accuracy)
print(paste("Best K for Highest Accuracy is",highest_K))[1] “Best K for Highest Accuracy is 1”
We can see that the highest the K, the less is the accuracy of the model is through cross validation.
# Setting Seed
set.seed(1)
# Computing Confusion Matrix with Best K
predictions_k <- knn(train=training_norm[,-c("Personal Loan")],test = validation_norm[,-c("Personal Loan")], cl = training_norm$`Personal Loan`, highest_K)
# Confusion Matrix
Confusion_Matrix_k <- confusionMatrix(data = predictions_k, reference = validation$`Personal Loan`)
# Plotting Matrix Function (In the References)
draw_confusion_matrix <- function(cm) {
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col='#1c6155')
text(195, 435, 'No Loan', cex=1.2)
rect(250, 430, 340, 370, col='#1c615570')
text(295, 435, 'Loan', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col='#1c615570')
rect(250, 305, 340, 365, col='#1c6155')
text(140, 400, 'No Loan', cex=1.2, srt=90)
text(140, 335, 'Loan', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k)This Confusion Matrix has a Accuracy of 0.968 and Specifity of 0.738, lower than the Sensitivity. Kappa is equal to 0.793. Among all Confusion Matrix, this one has the highest Specifity and Kappa. This Confusion Matrix has the highest F1 but really close to the last Confusion Matrix. Since the Specifity is the True negative rate, we can see that this model is not that good at predicting the Loan when it is actually a Loan. We can see that 48 customers were predicted no Loan while they actually had Loan, thus the following ratio:
\[ \dfrac{135}{135+48} = 0.738 = Specificity = True \ \ Negative \ \ Rate \]
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1.
# Setting Seed
set.seed(1)
# KNN Model on a specific customer not in Data
Customer_Test_2 <- data.frame("Age"=40,"Experience"=10,"Income"=84,"Family"=2,"CCAvg"=2,"Mortgage"=0,"Securities Account"=0,"CD Account"=0,"Online"=1,"CreditCard"=1,"Education_1"=0,"Education_2"=1,"Education_3"=0, check.names=FALSE)
# Preprocess the Customer New Data
Customer_Test_2_norm <- predict(norm_values, Customer_Test_2)
## KNN Training for Customer
predictions_customer_2 <- knn(train=training_norm[,-c("Personal Loan")],test = Customer_Test_2_norm, cl = training_norm$`Personal Loan`, k=highest_K)
# Append Predictions to Customer not in Data
Customer_Test_2$Predicted <- predictions_customer_2# Table Customer after Predictions
flextable(head(Customer_Test_2)) %>%
fontsize(size = 7, part = "all") Age | Experience | Income | Family | CCAvg | Mortgage | Securities Account | CD Account | Online | CreditCard | Education_1 | Education_2 | Education_3 | Predicted |
40 | 10 | 84 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | No Loan |
This Customer would also be classified as not getting a Loan (No Loan).
Now we want to partition the data in the following proportions: 50% in Training, 30% in Validation and 20% in Testing.
# Setting Seed
set.seed(1)
# Splitting Training and Validation and Test
splitting <- sample(1:3,size=nrow(UniversalBank1),replace=TRUE,prob=c(0.5,0.3,0.2))
train <- UniversalBank1[splitting==1,]
valid <- UniversalBank1[splitting==2,]
test <- UniversalBank1[splitting==3,]
# Checking if proportions are right
Prop_train <- (nrow(train)/nrow(UniversalBank1))*100
Prop_valid <- (nrow(valid)/nrow(UniversalBank1))*100
Prop_test <- (nrow(test)/nrow(UniversalBank1))*100
# Print Proportion
paste(Prop_train,"% In Training",Prop_valid,"% In Validation",Prop_test,"% In Test")## [1] "51.28 % In Training 27.74 % In Validation 20.98 % In Test"
We check the partition result.
Now we prepare the data sets and normalize them before making predictions with K-NN
# Setting Seed
set.seed(1)
# Removing Some Predictors
train <- train[,-c("ID","ZIP Code")]
valid <- valid[,-c("ID","ZIP Code")]
test <- test[,-c("ID","ZIP Code")]
# Target Variable As Factor
train$`Personal Loan` <- factor(train$`Personal Loan`, levels = c(0,1), labels=c("No Loan","Loan"))
valid$`Personal Loan` <- factor(valid$`Personal Loan`,levels = c(0,1), labels=c("No Loan","Loan"))
test$`Personal Loan` <- factor(test$`Personal Loan`,levels = c(0,1), labels=c("No Loan","Loan"))
# Education As Factor
train$Education <- as.factor(train$Education)
valid$Education <- as.factor(valid$Education)
test$Education <- as.factor(test$Education)
# Education One-Hot Encoding
Education_As_Dummy_Train <- model.matrix(~0+train$Education)
Education_As_Dummy_Valid <- model.matrix(~0+valid$Education)
Education_As_Dummy_Test <- model.matrix(~0+test$Education)
# Append to Training and Validation Sets
train <- cbind(train,Education_As_Dummy_Train)
train <- train[,-c("Education")]
valid <- cbind(valid,Education_As_Dummy_Valid)
valid <- valid[,-c("Education")]
test <- cbind(test,Education_As_Dummy_Test)
test <- test[,-c("Education")]
# Renaming Education
train = train %>% rename( Education_1 = `train$Education1` , Education_2 = `train$Education2`, Education_3 = `train$Education3`)
valid = valid %>% rename( Education_1 = `valid$Education1` , Education_2 = `valid$Education2`, Education_3 = `valid$Education3`)
test = test %>% rename( Education_1 = `test$Education1` , Education_2 = `test$Education2`, Education_3 = `test$Education3`)
# Preprocess for Data Normalization
library(caret)
train_norm <- train
validn_norm <- valid
test_norm <- test
train_norm_s <- train[,-c("Personal Loan")]
norm_values_2 <- preProcess(train_norm_s,method = c("center","scale"))
train_norm <- predict(norm_values_2,train)
valid_norm <- predict(norm_values_2,valid)
test_norm <- predict(norm_values_2,test)Confusion Matrix for Train VS Valid
# Train VS Valid
# Setting Seed
set.seed(1)
# Computing Confusion Matrix with Best K
predictions_k_new <- knn(train=train_norm[,-c("Personal Loan")],test = valid_norm[,-c("Personal Loan")], cl = train_norm$`Personal Loan`, highest_K)
# As Factor Predictions
predictions_k_new <- as.factor(predictions_k_new)
# Confusion Matrix
Confusion_Matrix_k_New <- confusionMatrix(data = predictions_k_new, reference = valid$`Personal Loan`)
#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k_New)Specificity is 0.669 which is lower than the Sensitivity (0.99), doing bad predictions for the True Negative (Loan) and worse than the first Confusion Matrix (0.669<0.738).
\[ \dfrac{87}{87+43} = 0.669 = Specificity = True \ \ Negative \ \ Rate \]
Confusion Matrix for Train VS Test
# Train VS Test
# Setting Seed
set.seed(1)
# Computing Confusion Matrix with Best K
predictions_k_new2 <- knn(train=train_norm[,-c("Personal Loan")],test = test_norm[,-c("Personal Loan")], cl = train_norm$`Personal Loan`, highest_K)
# As Factor Predictions
predictions_k_new2 <- as.factor(predictions_k_new2)
# Confusion Matrix
Confusion_Matrix_k_New2 <- confusionMatrix(data = predictions_k_new2, reference = test$`Personal Loan`)
#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k_New2)The Specificity is higher (0.716) than the previous Confusion Matrix (0.669 - Train VS Valid), but still lower than the Sensitivity. The Accuracy is a bit higher 0.965 > 0.96 than the previous Confusion Matrix (Train VS Valid). The Specifity is still lower than our first Confusion Matrix (0.716<0.738). There is more agreement in this Confusion Matrix than the previous one (Kappa => 0.767 > 0.735).
\[ \dfrac{68}{68+27} = 0.716 = Specificity = True \ \ Negative \ \ Rate \]
Since all Confusion Matrices give a bad Specificity Rate (lower than 0.8), I would not rely on this model too much when it comes to predicting the Customer getting Loan. This K-NN model give more “No Loan” correctly than the opposite.
UniversalBank2 <- fread("DATA/UniversalBank.csv")
# Setting Seed
set.seed(1)
# Splitting Training and Validation
sample2 <- sample(c(TRUE, FALSE), nrow(UniversalBank2), replace=TRUE, prob=c(0.6,0.4))
training_8 <- UniversalBank2[sample2, ]
validation_8 <- UniversalBank2[!sample2, ]
# Checking if proportions are right
training_8_prop <- (nrow(training_8)/nrow(UniversalBank2))*100
validation_8_prop <- (nrow(validation_8)/nrow(UniversalBank2))*100
paste(training_8_prop,"% In Training",validation_8_prop,"% In Validation")[1] “61 % In Training 39 % In Validation”
We can check the partition after the sampling code.
# Duplicata of Training Data for Pivot Data
pivot_data <- training_8
# As Factor
pivot_data$Online <- factor(pivot_data$Online,levels = c(0,1),labels=c("Inactive Online","Active Online"))
pivot_data$CreditCard <- factor(pivot_data$CreditCard,levels = c(0,1),labels=c("No Credit Card","Credit Card"))
pivot_data$`Personal Loan` <- factor(pivot_data$`Personal Loan`,levels = c(0,1),labels=c("No Personal Loan","Personal Loan"))
# Pivot Table
pt <- PivotTable$new()
pt$addData(pivot_data)
pt$addColumnDataGroups("Online")
pt$addRowDataGroups("CreditCard")
pt$addRowDataGroups("Personal Loan")
pt$defineCalculation(calculationName="Total", summariseExpression="n()")
pt$renderPivot()Here is the first pivot table with Loan and Credit Card as rows, and Online as columns.
Using Bayes Theorem
\[\small P(Loan=1 | CC=1 \cap Online=1) = \\ \small \dfrac{54}{506+54} =\dfrac{54}{560} = 0.09642857 = 9.64 \%\]
Thus, there is 9.64% probability that this kind of customer would accept the loan offer.
# Pivot Table 1
pt1 <- PivotTable$new()
pt1$addData(pivot_data)
pt1$addColumnDataGroups("Online")
pt1$addRowDataGroups("Personal Loan")
pt1$defineCalculation(calculationName="Total", summariseExpression="n()")
pt1$renderPivot()# Pivot Table 1
pt2 <- PivotTable$new()
pt2$addData(pivot_data)
pt2$addColumnDataGroups("CreditCard")
pt2$addRowDataGroups("Personal Loan")
pt2$defineCalculation(calculationName="Total", summariseExpression="n()")
pt2$renderPivot()Here are the two pivot table, each for looking at Credit Card VS Loan or Online VS Loan
\[\small P (CC=1 | Loan=1) = \\ \small \dfrac{P(Loan=1|CC=1)*P(CC=1)}{P(Loan=1)} = \\ \small 0.3198653 = 31.99 \%\]
\[\small P (Online=1 | Loan=1) = \\ \small \dfrac{P(Loan=1|Online=1)*P(Online=1)}{P(Loan=1)} = \\ \small \dfrac{188}{297}= 0.6329966 = 63.30 \%\]
\[\small P (Loan=1) = \dfrac{297}{3050} = 0.09737705 = 9.74 \% \]
\[\small P (CC=1 | Loan=0) = \\ \small \dfrac{P(Loan=0|CC=1)*P(CC=1)}{P(Loan=0)} = \\ \small \dfrac{827}{2753} = 0.3003996 = 30.04 \%\]
\[\small P (Online=1 | Loan=0) = \\ \small \dfrac{P(Loan=0|Online=1)*P(Online=1)}{P(Loan=0)} =\\ \small \dfrac{1626}{2753} = 0.5906284 = 59.06 \%\]
\[\small P (Loan=0) = \dfrac{2753}{3050} = 0.902623 = 90.26\%\]
\[\small P(Loan=1|CC=1,Online=1)\] Using the naive Bayes Probability give us the following computation:
\[\small P(Loan=1|CC=1,Online=1) = \\ \small \dfrac{P(Loan=1)*P(CC=1|Loan=1)*P(Online=1|Loan=1)}{P(CC=1)*P(Online=1)}=\\ \small \dfrac{0.09737705*0.3198653*0.6329966}{0.3022951*0.5947541} = 0.1096621 = 10.97\% \]
The Naive approach give us 10.97%, while the Complete Bayes probability give us 9.64%
\[P(Loan=1|CC=1,Online=1)\]
# As factor for Loan
training_8$Online <- factor(training_8$Online,levels = c(0,1),labels=c("Inactive Online","Active Online"))
training_8$CreditCard <- factor(training_8$CreditCard,levels = c(0,1),labels=c("No Credit Card","Credit Card"))
training_8$`Personal Loan` <- factor(training_8$`Personal Loan`,levels = c(0,1),labels=c("No Personal Loan","Personal Loan"))
Naivebayes <- naive_bayes(training_8$`Personal Loan` ~ training_8$CreditCard + training_8$Online, data=training_8)
summary(Naivebayes)##
## ================================== Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = training_8$`Personal Loan` ~ training_8$CreditCard + training_8$Online, data = training_8)
## - Laplace: 0
## - Classes: 2
## - Samples: 3050
## - Features: 2
## - Conditional distributions:
## - Bernoulli: 2
## - Prior probabilities:
## - No Personal Loan: 0.9026
## - Personal Loan: 0.0974
##
## ---------------------------------------------------------------------------------
Naivebayes##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = training_8$`Personal Loan` ~ training_8$CreditCard +
## training_8$Online, data = training_8)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## No Personal Loan Personal Loan
## 0.90262295 0.09737705
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: training_8$CreditCard (Bernoulli)
## ---------------------------------------------------------------------------------
##
## training_8$CreditCard No Personal Loan Personal Loan
## No Credit Card 0.6996004 0.6801347
## Credit Card 0.3003996 0.3198653
##
## ---------------------------------------------------------------------------------
## ::: training_8$Online (Bernoulli)
## ---------------------------------------------------------------------------------
##
## training_8$Online No Personal Loan Personal Loan
## Inactive Online 0.4093716 0.3670034
## Active Online 0.5906284 0.6329966
##
## ---------------------------------------------------------------------------------
\[\small P(Loan=1|CC=1,Online=1) = \\ \small \dfrac{P(Loan=1)*P(CC=1|Loan=1)*P(Online=1|Loan=1)}{P(CC=1)*P(Online=1)}=\\ \small \dfrac{0.09737705*0.3198653*0.6329966}{0.3022951*0.5947541} = 0.1096621 = 10.97\% \] > The Naive Model and above question e) should give us the same estimate probabilities. We can see a difference in the Naive approach and Complete Bayes Probabilities. Naive Bayes assumes conditional independence where Complete Bayes theorem does not, this is why there is such a difference in estimates, but such strong conditional independence may not be relevant everytime. For instance, having a credit card may be correlated to being online such that the credit card gives some special access to online services that the customer wanted, thus the buy of this credit card. The naive approach is faster to compute, but the Bayesian Network Classifiers would be perhaps better since you can specify which features could be independent or not.
Github Repo for this Homework 2
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R
How to Split Data into Training & Test Sets in R (3 Methods)
R how to visualize confusion matrix using the caret package
Friedman, N., Geiger, D., Provan, G., Langley, P. and Smyth, P. (1997). Bayesian Network Classifiers